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Multiplicative Weights 

In this lecture, we will study various applications of the theory of Multiplicative Weights (MW). In this 
section, we briefly review the general version of the MW algorithm that we studied in the previous lecture. 
The following sections then show how the theory can be applied to approximately solve zero-sum games and 
linear programs, and how it connects with the theory of boosting and approximation algorithms. 

We have n experts who predict the outcome of an event in consecutive rounds. Suppose that in each 
round there are P different outcomes for the event. If outcome j realizes, expert i pays a penalty of M{i,j). 
An important paremeter will prove to be the maximum allowable magnitude for the penalty. For that, let 
M{i,j) G [—£,p], with < £ < p, where p is the width. Our goal is to devise a strategy that dictates which 
expert's reccomendation to follow, in order to achieve an expected avegare penalty that is not much worse 
than that of the best expert (in hindsight). 

The strategy that we analyzed in the previous lecture is as follows. We maintain for each expert a 
scalar weight, which can be thought of as a quality score. Then, at each round we choose to follow the 
recommendation of a specific expert with probability that is proportional to her weight. After the outcome 
is realized, we update the weights of each expert accordingly. In mathematical terms, let wj be the weight 
of the ith epxert at the beginning of round t. Then, the MW algorithm is 

0. Initialize w] = 1, for all i. 

1. At step t, 

a. Follow the recommedation of the ith epxert with probability p*, where 




b. Let J* € P denote the outcome of the event at round t, and _D* = {p\, . . . ,pl^} the distribution we 
used above to select an expert. Our penalty is denoted by M{D*,f), and is equal to M{i,j*), 
where i is the selected expert. 

c. Update the weights as follows: 

t+i ^ i w'Kl - e)^(*'^')/^ if > 

~ \ wlil + e)-^^''^'yp ifM(z,i*)<0. 

In the previous lecture we argued that for any 5 > 0, for e < min 1 5 , ^ | and after T = ^^°s(") rounds 
and for all i, the average penalty we get per round obeys: 

T - T ■ 

In particular our average penalty per round is at most 5 bigger than the average penalty of the best expert. 
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Zero- Sum Games 



There are two players, labeled as the "row" player R and the "column" player C. Each player has a finite 
set of actions that he can follow. At each round, player R pays player C an amount that depends on the 
actions of the two players. In particular, if R plays action i and C plays j, the payoff from R to C is M{i,j). 
We assume that the payoffs are normalized, such that M{i,j) € [0, 1]. Naturally, player R tries to minimize 
the payoff, whereas player C tries to maximize it. 

Each player can follow a pure strategy, which dictates a single action to be played repeatedly, or a 
mixed strategy, under which the player has a fixed probability distribution over actions, and chooses actions 
randomly according to it. One might expect that the order in which players choose their actions might play 
a role, since knowledge of your opponent's strategy helps you to adopt your strategy appropriately. If we 
let D and P to be the row and column mixed strategies respectively, the von Neumann's minimax Theorem 
says that in this game, the order of choosing actions is actually indifferent for the players. Mathematically, 

A* := minmaxM(i3, jf) = maxminM(i, P), 

D j Pi 

where A* is the so called value of the game. Our goal is to approximate this value, up to some additive error 
S. 

We deploy the MW algorithm as follows. Let pure strategies for R correspond to experts, and pure 
strategies for C correspond to events. Then, the penalty paid by expert i in case of event j is exactly the 
payoff from R to C, if they follow strategies i and j accordingly, that is M{i,j). Assume also that for a 
mixed strategy D, we can efficiently compute the column strategy j that maximizes M{D,j) (a quantity 
eventually > A*). At step t of the MW algorithm, we choose a distribution D* over experts, which then 
corresponds to a mixed strategy for R. Given we compute the worst possible event, which is the column 
strategy that maximizes M(I?', j'). 

To see why this approach yields an approximation to A*, first note that for any distribution D, 

Y,M{D,f) > niin^M(z,/), (1) 
t * t 

since a distribution is just a weighted average of pure strategies. Furthermore, as we argued above we have 

M{D\f)>X*, (2) 

since we pick the payoff-maximizing column strategy. According to the MW theory, after T = 
rounds and for any distribution D we have 

T - i ) T j - T 

The first inerjuality follows from (2) and the second from (1). Since the above is true for any distribution D, 
it is also true for the optimal distribution, and hence 

This demonstrates that the average penalty of the algorithm is an approximation of the value of the game, 
within and additive positive term of 6. Note that also the average mixed strategy, or the best strategy 
D* , constitutes an approximately optimal strategy as well, since its payoff is approximately the value of the 
game, against an optimally acting player. 

Linear Programming and the Plotkin-Shmoys-Tardos framework 

There are various ways in which MW theory can used to solve linear programs. Given what we developed 
in the previous section, one immediate way is to cast the LP as a zero-sum game and solve it via MW. 
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Not{^ that tli(^re arc sonic interesting trade offs between this idea and the tracUtional ways of solving linear 
programming problems. In particular, ellipsoid and interior point algorithms (IP) achieve an error of 6 
in 0(poly(n) log(y)) steps. Their dependence on the corresponding notion of the MW penalty width is 
logarithmic. On the other hand, the MW algorithm achieves an error after O(^) steps, in case the width 
is 1. Otherwise, the dependence on the width is quadratic, as we have shown. To summarize, IP algorithms 
are much better with respect to error and size of numbers (i.e., width), whereas MW are much better with 
respect to the dimension n. 

We now switch focus to the Plotkin-Shmoys-Tardos framework, which is a more direct way of applying 
MW to linear programming. Our goal is to check to feasibility of a set of linear inequalities, 

Ax>b, x>0, 

where A = [ai . . . am]^ is an m x n matrix and x an n dimensional vector, or more precisely to find an 
approximately feasible solution x* > 0, such that for some S > 0, 

ajx* >bi- 5, Vi 

The analysis will be based on an oracle that answers the following question: Given a vector c and a 
scalar d, does there exist an a; > 0, such that c^x > dl With this oracle, we will be able to repeatedly check 
whether a convex combination of the initial linear inequalities, aJx > bi, is infeasible; a condition that is 
sufficient for the infeasibility of our original problem. Note that the oracle is straightforward to construct, 
as it involves a single inequality. In particular, it returns a negative answer if d > and c < 0. 

The algorithm is as follows. Experts correspond to each of the m constraints, and events correspond 
to points X > 0. The penalty for the ith expert for the event x will be afx — hi, and is assumed to take 
values in [~p,p\. Although one might expect the penalty to be the violation of the constraint, it is exactly 
the opposite; the reason is that the algorithm is trying to actually prove infeasibility of the problem. In 
the tth round, we use our distribution over experts to generate an inequality that would be valid, if the 
problem were feasible: if our distribution is p{, . . . ,pIji, the inequality is '^j^plafx > 'YliPl^i- The oracle 
then either detects infeasibility of this constraint, in which case the original problem is infeasible, or returns 
a point X* > that satisfies the inequality. The penalty we pay then is equal to '^iPlioJx^ — bi), and the 
weights are updated accordingly. Note that in case infeasibility is not detected, the penalty we pay is always 
nonnegative, since x* satisfies the checked inequality. 

If after T = ^og(n) infeasibility is not detected, we have the following guarantee by the MW theory: 

- T - T ' 

for every i. The first inequality follows by the nonnegativity of all penalties. If we take x to be the average 
of all visited points x*, 

T ' 

then this is our approximate solution, since from the above inequality we get for all i 

< ^ + afx — bi=^ aJx >bi — S. 

Boosting 

We now visit a problem from the area of Machine Learning. Suppose that we are given a sequence of training 
points, Xi, . . . , xjv, which are drawn from a fixed but unknown to us distribution V. Alongide, we are given 
corresponding 0—1 labels, c(xi), . . . , c{xn), assigned to each point, where c is a function from some concept 
class C that maps points onto 0—1 labels. Our goal is to generate a hypothesis function h that assigns 
labels to points, replicating the func;tion c in the best way possible. This is captured by the average absolute 
error, [\h{x) — c(x)|]. We call a learning algorithm to be strong, if for every distribution V and any 
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fixed e. S > 0, it outputs with probability at least 1 — S a hypothesis h that achieves error no more than e. 
Similarly, it is called j-weak, if the error is at most 0.5 — 7. 

Boosting is a very useful, both in theory and in practice, tool of combining weak rules of thumb into 
strong predictors. In particular, the theory of Boosting shows that if there exists a 7- weak learning algorithm 
for C, then there also exists a strong one. We will show this in case we have a fixed training set with N 
points, and where the strong algorithm has a small error with respect to the uniform distribution on the 
training set. 

We use the MW algorithm. In the tth round, we assign a different distribution 2?* on the training set, 
and use the weak learning algorithm to retrieve a hypothesis ht, which by assumption has error at most 
0.5 — 7, with respect to I?*. Our final hypothesis after T rounds, /ifinab is obtained by taking majority 
vote among hi,. . . , hx- The experts in this case are the samples in the training set, and the events are the 
hypotheses produced by the weak learning algorithm. The associated penalty for expert x on hypothesis ht 
is 1 if ht{x) = c(x), and otherwise. As in the previous exemple, we penalize the experts that "are doing 
well" , as we want to eventually increase the weight of a point (expert) if our hypothesis got it wrong. We 
can start with being the uniform distribution, and we update according to the MW algorithm. Finally, 
after 




rounds we get an error rate for hdnai on the training set, under the uniform distribution, that is at most e, 
as required. 

Approximation Algorithms 

We conclude with an application that demonstrates how to use the MW algorithm to get O(logn) approxi- 
mation algorithms for many NP-hard problems. The problem that will focus on is the SET COVER problem: 
Given a universe of elements, U = {!,..., n}, and a collection C = {Ci, . . . , Cm} of subsets of U, whose 
union equals U, we want to pick a minimum number of sets from C to cover all of U. An immediate algorithm 
to tackle this problem is the greedy heuristic: at each step, choose the set from C that has not been chosen 
yet and that covers the most out of the yet uncovered elements of U. The MW algorithm will end up taking 
exaclty the form of that greedy algorithm, and will further prove the approximation bound. 

We associate the elements of the universe with experts, and the sets of C with events. The penalty for 
expert i under event Cj, M{i,Cj), will be equal to 1 if z S Cj, and otherwise. In this case, we use the 
following simplified rule for updating the weights, 

wl+^ = wlil-M{t,C,)). 

The update rule then gives elements that are covered by the newly chosen set a weight of 0, leaving the 
remaining unaltered. Consequently, the weight of element i in round t is either or 1, depending on if it has 
being covered already, or not. The distribution we will be using then in round t, 

t _ w| 

is just a uniform distribution over the uncovered elements by round t. We then choose the maximally 
adversarial event (that is, the one that maximizes the penalty), which coincides with the set Cj that covers 
a maximum number of uncovered elements, and update our weights. The described MW algorithm coincides 
with the greedy algorithm, in repeatedly picking the set that covers the most uncovered elements. 

For any distribution p\,. ■ ■ ,Pn on the elements, we have that OPT sets cover everything. That means 
that the total weights of sets involved (accoring to the distribution p) is at least 1, and hence at least one of 
the remaining sets must cover at least 1/OPT fraction. Mathematically, 

max ^ p* > 1/OPT. 
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That shows that after every round, the total penalty drops significantly: 

$t+l ^ ^tg-l/OPT 

The inequality is strict, since the penalty is always positive. Using = n, after OPT log n iterations we get 
$ < 1 $ = 0, which shows that we can cover everything with OPT log n sets — an log n approximation. 
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